How to Scrape Tripadvisor Reviews and Perform Sentiment Analysis with AI

Introduction

In today's digitally driven world, businesses thrive on data-driven insights to make informed decisions and provide exceptional customer experiences.
One goldmine of valuable information lies within online reviews, where travelers share their opinions and experiences with the world.

Tripadvisor, a renowned platform for travel enthusiasts, boasts a vast repository of user-generated reviews, offering a treasure trove of insights waiting to be unlocked.
However, analyzing these reviews manually can be a daunting task, especially for large corporations with thousands of reviews.
This is where scraping and sentiment analysis with AI can come in handy.

In this blog post, we will explore the step-by-step process of building a Tripadvisor scraper using Page2API, and then performing sentiment analysis on the extracted data using GPT-3.5-turbo.

By the end of this tutorial, you will have a better understanding of how to leverage AI tools to extract valuable insights from online reviews and improve your business's reputation.

Prerequisites

To start scraping Tripadvisor reviews, we will need the following things:

A Page2API account
An OpenAI account
A Tripadvisor location (let's say a hotel) that we are interested in.
In our case, the hotel will be NH City Centre Amsterdam.
Some basic Ruby coding skills.

How to scrape Tripadvisor reviews

First what we need is to open the chosen hotel's Tripadvisor reviews page.

The exact URL will be:

  
    https://www.tripadvisor.com/Hotel_Review-g188590-d194317-Reviews-NH_City_Centre_Amsterdam-Amsterdam_North_Holland_Province.html#REVIEWS

We will use this URL as the first parameter we need to start the scraping process.

The page that you see must look like the following one:

From the Tripadvisor reviews page, we will scrape the following attributes from each review:

Title
Content

Now, let's define the selectors for each attribute.

  
    /* Parent: */
    [data-reviewid]

    /* Title: */
    [data-test-target=review-title]

    /* Content: */
    [data-test-target=review-title] + div > div div span

Let's handle the pagination.
There are two approaches that can help us scrape all the needed pages:

1. We can scrape the pages using the batch scraping feature
2. We can iterate through the pages by clicking on the Next page button

To keep the article short enough, we will only cover the batch approach.

Now it's time to build the request that will scrape Tripadvisor reviews.

The following examples will show how to scrape 3 pages of reviews from Tripadvisor.com

With the batch scraping approach, our payload will look like:

  
    {
      "api_key": "YOUR_PAGE2API_KEY",
      "batch": {
        "urls": [
          "https://www.tripadvisor.com/Hotel_Review-g188590-d194317-Reviews.html#REVIEWS",
          "https://www.tripadvisor.com/Hotel_Review-g188590-d194317-Reviews-or10.html#REVIEWS",
          "https://www.tripadvisor.com/Hotel_Review-g188590-d194317-Reviews-or20.html#REVIEWS"
        ],
        "merge_results": true,
        "concurrency": 1
      },
      "real_browser": true,
      "premium_proxy": "us",
      "wait_for": "[data-reviewid]",
      "parse": {
        "reviews": [
          {
            "_parent": "[data-reviewid]",
            "title": "[data-test-target=review-title] >> text",
            "content": "[data-test-target=review-title] + div > div div span >> text"
          }
        ]
      }
    }

Code examples (batch scraping approach)

      
    require 'rest_client'
    require 'json'

    api_url = "https://www.page2api.com/api/v1/scrape"
    payload = {
      api_key: "YOUR_PAGE2API_KEY",
      batch: {
        urls: [
          "https://www.tripadvisor.com/Hotel_Review-g188590-d194317-Reviews.html#REVIEWS",
          "https://www.tripadvisor.com/Hotel_Review-g188590-d194317-Reviews-or10.html#REVIEWS",
          "https://www.tripadvisor.com/Hotel_Review-g188590-d194317-Reviews-or20.html#REVIEWS"
        ],
        merge_results: true,
        concurrency: 1
      },
      real_browser: true,
      premium_proxy: "us",
      wait_for: "[data-reviewid]",
      parse: {
        reviews: [
          {
            _parent: "[data-reviewid]",
            title: "[data-test-target=review-title] >> text",
            content: "[data-test-target=review-title] + div > div div span >> text"
          }
        ]
      }
    }

    response = RestClient::Request.execute(
      method: :post,
      payload: payload.to_json,
      url: api_url,
      headers: { "Content-type" => "application/json" },
    ).body

    result = JSON.parse(response)

    puts(result)

      
    import requests
    import json

    api_url = 'https://www.page2api.com/api/v1/scrape'
    payload = {
      "api_key": "YOUR_PAGE2API_KEY",
      "batch": {
        "urls": [
          "https://www.tripadvisor.com/Hotel_Review-g188590-d194317-Reviews.html#REVIEWS",
          "https://www.tripadvisor.com/Hotel_Review-g188590-d194317-Reviews-or10.html#REVIEWS",
          "https://www.tripadvisor.com/Hotel_Review-g188590-d194317-Reviews-or20.html#REVIEWS"
        ],
        "merge_results": True,
        "concurrency": 1
      },
      "real_browser": True,
      "premium_proxy": "us",
      "wait_for": "[data-reviewid]",
      "parse": {
        "reviews": [
          {
            "_parent": "[data-reviewid]",
            "title": "[data-test-target=review-title] >> text",
            "content": "[data-test-target=review-title] + div > div div span >> text"
          }
        ]
      }
    }

    headers = {'Content-type': 'application/json', 'Accept': 'text/plain'}
    response = requests.post(api_url, data=json.dumps(payload), headers=headers)
    result = json.loads(response.text)

    print(result)

      
    <?php

    $api_url = "https://www.page2api.com/api/v1/scrape";
    $payload = [
      "api_key" => "YOUR_PAGE2API_KEY",
      "batch" => [
        "urls" => [
          "https://www.tripadvisor.com/Hotel_Review-g188590-d194317-Reviews.html#REVIEWS",
          "https://www.tripadvisor.com/Hotel_Review-g188590-d194317-Reviews-or10.html#REVIEWS",
          "https://www.tripadvisor.com/Hotel_Review-g188590-d194317-Reviews-or20.html#REVIEWS"
        ],
        "concurrency" => 1,
        "merge_results" => true
      ],
      "real_browser" => true,
      "premium_proxy" => "us",
      "wait_for" => "[data-reviewid]",
      "parse" => [
        "reviews" => [
          "0" => [
            "title" => "[data-test-target=review-title] >> text",
            "_parent" => "[data-reviewid]",
            "content" => "[data-test-target=review-title] + div > div div span >> text"
          ]
        ]
      ]
    ];

    $postdata = json_encode($payload);
    $ch = curl_init($api_url);
    curl_setopt($ch,CURLOPT_POST, true);
    curl_setopt($ch,CURLOPT_POSTFIELDS, $postdata);
    curl_setopt($ch, CURLOPT_HTTPHEADER, array('Content-Type:application/json'));
    curl_setopt($ch,CURLOPT_RETURNTRANSFER, true);
    $result = curl_exec($ch);
    curl_close($ch);

    echo $result;

    ?>

      
    const axios = require('axios');
    const api_url = 'https://www.page2api.com/api/v1/scrape';
    const payload = {
      api_key: 'YOUR_PAGE2API_KEY',
      batch: {
        urls: [
          "https://www.tripadvisor.com/Hotel_Review-g188590-d194317-Reviews.html#REVIEWS",
          "https://www.tripadvisor.com/Hotel_Review-g188590-d194317-Reviews-or10.html#REVIEWS",
          "https://www.tripadvisor.com/Hotel_Review-g188590-d194317-Reviews-or20.html#REVIEWS"
        ],
        merge_results: true,
        concurrency: 1
      },
      real_browser: true,
      premium_proxy: "us",
      wait_for: "[data-reviewid]",
      parse: {
        reviews: [
          {
            _parent: "[data-reviewid]",
            title: "[data-test-target=review-title] >> text",
            content: "[data-test-target=review-title] + div > div div span >> text"
          }
        ]
      }
    };

    axios.post(api_url, payload)
      .then((res) => {
        console.log(JSON.stringify(res.data, null, "  "));
      }).catch((err) => {
         console.error(err);
      });

      
    curl -XPOST -H "Content-type: application/json" -d '{
      "api_key": "YOUR_PAGE2API_KEY",
      "batch": {
        "urls": [
          "https://www.tripadvisor.com/Hotel_Review-g188590-d194317-Reviews.html#REVIEWS",
          "https://www.tripadvisor.com/Hotel_Review-g188590-d194317-Reviews-or10.html#REVIEWS",
          "https://www.tripadvisor.com/Hotel_Review-g188590-d194317-Reviews-or20.html#REVIEWS"
        ],
        "merge_results": true,
        "concurrency": 1
      },
      "real_browser": true,
      "premium_proxy": "us",
      "wait_for": "[data-reviewid]",
      "parse": {
        "reviews": [
          {
            "_parent": "[data-reviewid]",
            "title": "[data-test-target=review-title] >> text",
            "content": "[data-test-target=review-title] + div > div div span >> text"
          }
        ]
      }
    }' 'https://www.page2api.com/api/v1/scrape' | python3.10 -mjson.tool

The result

  
    {
      "result": {
        "reviews": [
          {
            "title": "Great Hotel in city centre",
            "content": "Our stay here was excellent! We were greeted on arrival by Frankie who helped us settle in and gave us wonderful site seeing tips. "
          },
          {
            "title": "Fantastic stay in this gem",
            "content": "Our family had a wonderful visit to Amsterdam, due in large part to this hotel. From the moment we arrived on property, we felt welcomed. ."
          },
          {
            "title": "Makes you trip memorable",
            "content": "We had an amazing stay at the hotel. In fact we ended cancelled another booking and booked this hotel last minute. "
          }, ...
        ]
      }, ...
    }

How to summarize the reviews and perform the Sentiment Analysis with AI (GPT-3.5-turbo)

In the following part of the article, we will:

Collect the scraped Tripadvisor reviews and clean them up a little bit.
Join the reviews into a single entity, separating each of them by a new line.
Build a GPT prompt.
Send the reviews content and the prompt to GPT.
Enjoy the results.

From the code perspective, we will:

Switch to Ruby. Because Ruby is cool and easy to read.
Separate the code into two classes to enhance the readability.
Provide the possibility to change the reviews page and the number of total pages dynamically.

Let's start by creating a new file (gpt.rb) with the following structure

  
  require 'rest_client'
  require 'json'

  class Page2APIParser
    def initialize(url, pages)
    end

    def perform
    end
  end

  class GPTAnalyzer
    def initialize(reviews_content)
    end

    def perform
    end
  end

  reviews_url = ARGV[0] || raise('The reviews URL was not provided!')
  pages = ARGV[1].to_i.nonzero? || 1

  page2api = Page2APIParser.new(reviews_url, pages)
  page2api.perform

  gpt = GPTAnalyzer.new(page2api.reviews_content)
  gpt.perform

  puts gpt.result

This is our main script

It receives 2 arguments: the Tripadvisor reviews page URL, and the number of total pages to scrape.
The script can be called from the terminal like in the following examples:

For one page

  
    $ ruby gpt.rb https://www.tripadvisor.com/Hotel_Review-g188590-d194317-Reviews-NH_City_Centre_Amsterdam-Amsterdam_North_Holland_Province.html

For multiple pages

  
    $ ruby gpt.rb https://www.tripadvisor.com/Hotel_Review-g188590-d194317-Reviews-NH_City_Centre_Amsterdam-Amsterdam_North_Holland_Province.html 2

Now let's use the code from the first part of the article and build the parser

  
  require 'rest_client'
  require 'json'

  class Page2APIParser
    API_KEY = ''

    attr_reader :url, :pages, :reviews_content

    def initialize(url, pages)
      @url = url
      @pages = pages
    end

    def perform
      response = RestClient::Request.execute(
        method: :post,
        payload: payload.to_json,
        url: 'https://www.page2api.com/api/v1/scrape',
        headers: { "Content-type" => "application/json" },
      ).body

      reviews = JSON.parse(response)

      # We will iterate through all the reviews and if the review title will be contained
      # in the review body (aka content) - it will be ignored.
      # Otherwise - it will be glued together with the review content.

      compacted_reviews = reviews.map do |review|
        title = review['title'].gsub('…', '')
        content = review['content']

        content.include?(title) ? content : "#{title}. #{content}"
      end

      @reviews_content = compacted_reviews.join("\n\n")
    end

    private

    def payload
      {
        api_key: API_KEY,
        batch: {
          urls: reviews_urls,
          concurrency: 1,
          merge_results: true
        },
        raw: {
          key: "reviews"
        },
        real_browser: true,
        premium_proxy: "us",
        wait_for: "[data-reviewid]",
        parse: {
          reviews: [
            {
              _parent: "[data-reviewid]",
              title: "[data-test-target=review-title] >> text",
              content: "[data-test-target=review-title] + div > div div span >> text"
            }
          ]
        }
      }
    end

    def reviews_urls
      return [url] if pages == 1

      (0..(10 * pages - 1)).step(10).to_a.map do |offset|
        url.gsub('-Reviews-', "-Reviews-or#{offset}-")
      end
    end
  end

You can test the parser by updating the API_KEY

  
    API_KEY = 'Your Page2API API key'

and running

  
    page2api = Page2APIParser.new('https://www.tripadvisor.com/Hotel_Review-g188590-d194317-Reviews-NH_City_Centre_Amsterdam-Amsterdam_North_Holland_Province.html, 1)
    page2api.perform

    puts page2api.reviews_content

The parser will generate the following content

  
    Great location & spacious rooms. We enjoyed our stay finding the rooms and facilities very comfortable and the location very convenient to walk about the city and get to the places we wanted to visit. We were surprised and grateful that the hotel gave us access to a room way before the check-in time (am) so we could take our bags to our room and head out for the day. My very tall husband appreciated the space in the room and shower after constantly hitting his head at our previous tiny accommodation.

    Know what to expect and you will enjoy. Very nice hotel for a short stay. Great location for walking to shopping, restaurants, bars, and other sights. Front desk staff were pleasant and helpful when the desk was staffed (not always). Breakfast was delicious and plentiful. Hostess of The Patio, Sri, provided a warm welcome daily. Please note there is little on the menu after breakfast. (Although I enjoyed the chicken Caesar salad I had one afternoon.) I’m not sure the restaurant offers dinner (or room service). The Friday evening I was there, the kitchen was closed at 19:30. Very disappointing and inconvenient. The “Open Bar” offers many “grab and go” beverages (adult and other), but the food selection is commercial snack food only.

    Short Vacation in Amsterdam. The position of the hotel is right in the centre of Amsterdam, close to many attractions and restaurants. The rooms are comfortable, but rather small, and clean. One negative point, in the afternoon with the sun shining the room became very warm and the air conditioning took several hours to cool the room. Breakfast was very good with a wide variety to choose from. Personnel at the reception and at the breakfast were very pleasant and helpful. Certainly a positive experience to be repeated.

    ...

Now let's build the GPT analyzer.

The working principle is similar, but instead of reviews URL and the number of pages, the class will receive the reviews content, build a payload, send it to GPT API and print the result.

We will use the following GPT prompt for our request

  
    Summarize the reviews by Positives and Negatives in bullet points. Perform the Sentiment Analysis.

Here is our GPT class

  
  require 'rest_client'
  require 'json'

  class GPTAnalyzer
    API_KEY = ''

    attr_reader :reviews_content, :result

    def initialize(reviews_content)
      @reviews_content = reviews_content
    end

    def perform
      response = RestClient::Request.execute(
        method: :post,
        payload: payload.to_json,
        url: 'https://api.openai.com/v1/chat/completions',
        headers: {
          "Content-type" => "application/json",
          "Authorization" => "Bearer #{API_KEY}"
        },
      ).body

      analysis = JSON.parse(response)

      @result = analysis.dig('choices', 0, 'message', 'content')
    end

    private

    def payload
      {
        model: "gpt-3.5-turbo",
        messages: [
          {
            role: "system",
            content: "Summarize the reviews by Positives and Negatives in bullet points. Perform the Sentiment Analysis."
          },
          {
            role: "user",
            content: reviews_content
          }
        ]
      }
    end
  end

You can test the GPT analyzer by updating the API_KEY

  
    API_KEY = 'Your OpenAI API key'

and running

  
    reviews_content = <<-TEXT
      Great location & spacious rooms. We enjoyed our stay finding the rooms and facilities very comfortable and the location very convenient to walk about the city and get to the places we wanted to visit. We were surprised and grateful that the hotel gave us access to a room way before the check-in time (am) so we could take our bags to our room and head out for the day. My very tall husband appreciated the space in the room and shower after constantly hitting his head at our previous tiny accommodation.

      Know what to expect and you will enjoy. Very nice hotel for a short stay. Great location for walking to shopping, restaurants, bars, and other sights. Front desk staff were pleasant and helpful when the desk was staffed (not always). Breakfast was delicious and plentiful. Hostess of The Patio, Sri, provided a warm welcome daily. Please note there is little on the menu after breakfast. (Although I enjoyed the chicken Caesar salad I had one afternoon.) I’m not sure the restaurant offers dinner (or room service). The Friday evening I was there, the kitchen was closed at 19:30. Very disappointing and inconvenient. The “Open Bar” offers many “grab and go” beverages (adult and other), but the food selection is commercial snack food only.

      Short Vacation in Amsterdam. The position of the hotel is right in the centre of Amsterdam, close to many attractions and restaurants. The rooms are comfortable, but rather small, and clean. One negative point, in the afternoon with the sun shining the room became very warm and the air conditioning took several hours to cool the room. Breakfast was very good with a wide variety to choose from. Personnel at the reception and at the breakfast were very pleasant and helpful. Certainly a positive experience to be repeated.

    TEXT

    gpt = GPTAnalyzer.new(reviews_content)
    gpt.perform

    puts gpt.result

Now let's glue everything together

  
  require 'rest_client'
  require 'json'

  class Page2APIParser
    API_KEY = 'Your Page2API API key'

    attr_reader :url, :pages, :reviews_content

    def initialize(url, pages)
      @url = url
      @pages = pages
    end

    def perform
      response = RestClient::Request.execute(
        method: :post,
        payload: payload.to_json,
        url: 'https://www.page2api.com/api/v1/scrape',
        headers: { "Content-type" => "application/json" },
      ).body

      reviews = JSON.parse(response)

      compacted_reviews = reviews.map do |review|
        title = review['title'].gsub('…', '')
        content = review['content']

        content.include?(title) ? content : "#{title}. #{content}"
      end

      @reviews_content = compacted_reviews.join("\n\n")
    end

    private

    def payload
      {
        api_key: API_KEY,
        batch: {
          urls: reviews_urls,
          concurrency: 1,
          merge_results: true
        },
        raw: {
          key: "reviews"
        },
        real_browser: true,
        premium_proxy: "us",
        wait_for: "[data-reviewid]",
        parse: {
          reviews: [
            {
              _parent: "[data-reviewid]",
              title: "[data-test-target=review-title] >> text",
              content: "[data-test-target=review-title] + div > div div span >> text"
            }
          ]
        }
      }
    end

    def reviews_urls
      return [url] if pages == 1

      (0..(10 * pages - 1)).step(10).to_a.map do |offset|
        url.gsub('-Reviews-', "-Reviews-or#{offset}-")
      end
    end
  end

  class GPTAnalyzer
    API_KEY = 'Your OpenAI API key'

    attr_reader :reviews_content, :result

    def initialize(reviews_content)
      @reviews_content = reviews_content
    end

    def perform
      response = RestClient::Request.execute(
        method: :post,
        payload: payload.to_json,
        url: 'https://api.openai.com/v1/chat/completions',
        headers: {
          "Content-type" => "application/json",
          "Authorization" => "Bearer #{API_KEY}"
        },
      ).body

      analysis = JSON.parse(response)

      @result = analysis.dig('choices', 0, 'message', 'content')
    end

    private

    def payload
      {
        model: "gpt-3.5-turbo",
        messages: [
          {
            role: "system",
            content: "Summarize the reviews by Positives and Negatives in bullet points. Perform the Sentiment Analysis."
          },
          {
            role: "user",
            content: reviews_content
          }
        ]
      }
    end
  end



  reviews_url = ARGV[0] || raise('The reviews URL was not provided!')
  pages = ARGV[1].to_i.nonzero? || 1

  page2api = Page2APIParser.new(reviews_url, pages)
  page2api.perform

  gpt = GPTAnalyzer.new(page2api.reviews_content)
  gpt.perform

  puts gpt.result

Let's run the script

  
    $ ruby gpt.rb https://www.tripadvisor.com/Hotel_Review-g188590-d194317-Reviews-NH_City_Centre_Amsterdam-Amsterdam_North_Holland_Province.html 2

The result must look like the following one

  
    Positives:
    - Great hotel in city center
    - Excellent stay
    - Helpful and friendly staff
    - Spacious rooms
    - Convenient location
    - Good breakfast
    - Clean and well-maintained

    Negatives:
    - Rooms could be better organized
    - Some issues with check-in and room allocation
    - Limited menu options for lunch and dinner
    - Some issues with room temperature and air conditioning

    Sentiment Analysis:
    Overall, the majority of reviews are positive, with guests praising the hotel's location, friendly staff, spacious rooms, and good breakfast.
    Some guests had minor complaints about room organization, check-in issues, and limited menu options.
    However, these issues were outweighed by the positive experiences of most guests.

Conclusion

In conclusion, scraping Tripadvisor reviews with Page2API can be a powerful tool for businesses looking to improve their online reputation and customer acquisition.

By leveraging the code examples provided in this blog post, you can easily extract and summarize large volumes of review data from Tripadvisor using various programming languages.
Additionally, by performing sentiment analysis on this data, you can gain valuable insights into customer feedback and identify areas of improvement.

With the help of AI and natural language processing techniques, businesses can better understand their customers and make data-driven decisions to improve their products and services.

In addition to analyzing TripAdvisor reviews, the power of AI extends to other domains such as news aggregation and summarization.
For instance, our comprehensive guide on How to Scrape News Articles and Summarize the Content with AI demonstrates a similar approach to extracting and condensing information from news articles.

This technique can be particularly beneficial for professionals who need to stay updated with the latest news but have limited time.

We hope that this tutorial has provided you with a better understanding of how to scrape and analyze Tripadvisor reviews, and how to leverage these insights to improve your business.

How to Scrape Tripadvisor Reviews and Perform Sentiment Analysis with AI

Introduction

Prerequisites

How to scrape Tripadvisor reviews

How to summarize the reviews and perform the Sentiment Analysis with AI (GPT-3.5-turbo)

Conclusion

You might also like

How to Download Instagram Videos with iPhone Shortcuts

How to Scrape News Articles and Summarize the Content with AI

How to Scrape Trustpilot Reviews and Perform Sentiment Analysis with AI

What customers are saying

Ready to Scrape the Web like a PRO?